Preprocessing and Visualizing Time Series in Python

\(\hspace{0.3cm}\) More articles: \(\hspace{0.1cm}\) Estadistica4all

\(\hspace{0.3cm}\) Author: \(\hspace{0.1cm}\) Fabio Scielzo Ortiz

\(\hspace{0.3cm}\) If you use this article, please reference it:

\(\hspace{0.5cm}\) Scielzo Ortiz, Fabio. (2023). Preprocessing and Visualizing Time Series in Python. Estadistica4all. http://estadistica4all.com/Articulos/Preprocessing_and_Visualizing_Time_Series_in_Python.html

It’s recommended to open the article in a computer or tablet.


1 Preprocessing Time Series in Python

First of all, we import some of the libraries we will be using throughout this article:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

We load a time series data-set, in this case the minimum periodicity of the time series is daily:

Time_Series_1 = pd.read_csv('Time_Series_1.csv')

We are going to expose some preprocessing time series task, not many but important.

When we have a time series, usually we will have a date column, in this data-set called Fecha.

When we have identified the date column we must convert it to date-time format. We can do it using Pandas as follows:

Time_Series_1['Fecha'] = pd.to_datetime(Time_Series_1['Fecha'])

Other important preprocessing task when we have a time series data-set is to create new columns with the day, week, month, quarter, and year associated to each observation.

We can carry out these tasks with Pandas:

Time_Series_1['Dia'] = Time_Series_1['Fecha'].dt.day

Time_Series_1['Semana'] = Time_Series_1['Fecha'].dt.week

Time_Series_1['Mes'] = Time_Series_1['Fecha'].dt.month

Time_Series_1['Trimestre'] = Time_Series_1['Fecha'].dt.quarter

Time_Series_1['Año'] = Time_Series_1['Fecha'].dt.year

We select the specific columns we will use throughout this article. The response variable will be the Ventas column, and the rest are columns related with the observation date.

Time_Series_1 = Time_Series_1.loc[: , ['Fecha', 'Dia', 'Semana', 'Mes', 'Trimestre', 'Año', 'Ventas']]
Time_Series_1
Fecha Dia Semana Mes Trimestre Año Ventas
0 2022-06-21 21 25 6 2 2022 59.99
1 2021-08-03 3 31 8 3 2021 12.72
2 2022-08-21 21 33 8 3 2022 11.20
3 2022-07-09 9 27 7 3 2022 48.97
4 2022-05-14 14 19 5 2 2022 22.95
704350 2022-02-10 10 6 2 1 2022 18.90
704351 2022-06-02 2 22 6 2 2022 -29.99
704352 2022-05-19 19 20 5 2 2022 75.00
704353 2022-02-17 17 7 2 1 2022 -10.70
704354 2021-06-06 6 22 6 2 2021 6.95

704355 rows × 7 columns


2 Visualizing Time Series in Python

In this section we are going to visualize our time series using different periodicities.

2.1 Monthly Time Series

We use groupby to group the time series by year and month, and then calculate the sum of the Ventas column for each of those groups.

Monthly_Time_Series_1 = Time_Series_1.groupby(['Año', 'Mes'])['Ventas'].sum().reset_index(drop=False)

We change the Año (year) and Mes (month) columns to string format, and then we create the Mes-Año (month-year) column. All this is needed for the posterior visualization.

Monthly_Time_Series_1['Año'] = Monthly_Time_Series_1['Año'].astype('string')
Monthly_Time_Series_1['Mes'] = Monthly_Time_Series_1['Mes'].astype('string')

Monthly_Time_Series_1['Mes-Año'] = Monthly_Time_Series_1[['Mes', 'Año']].agg('-'.join, axis=1)
Monthly_Time_Series_1
Año Mes Ventas Mes-Año
0 2021 6 992891.50 6-2021
1 2021 7 982142.75 7-2021
2 2021 8 885289.16 8-2021
3 2021 9 878563.14 9-2021
4 2021 10 923552.45 10-2021
5 2021 11 1527486.61 11-2021
6 2021 12 1438722.72 12-2021
7 2022 1 1910816.46 1-2022
8 2022 2 1317501.34 2-2022
9 2022 3 1524652.47 3-2022
10 2022 4 2060509.71 4-2022
11 2022 5 2316733.47 5-2022
12 2022 6 2872219.23 6-2022
13 2022 7 2731251.02 7-2022
14 2022 8 1844222.02 8-2022
15 2022 9 1068975.95 9-2022
16 2022 10 895735.29 10-2022
17 2022 11 1544511.61 11-2022
18 2022 12 1272814.69 12-2022


Now we create a monthly time series plot using Seaborn.

fig, ax = plt.subplots(figsize=(13,8))

p=sns.lineplot(x="Mes-Año", y="Ventas", data=Monthly_Time_Series_1 , color='red')

plt.setp(p.get_xticklabels(), rotation=90)

plt.title("Monthly Time Series",  fontsize = 17)

plt.tight_layout()

fig.savefig('p1.jpg', format='jpg', dpi=500)

plt.show()


2.2 Daily Time Series

We use groupby to group the time series by year, month and day, and then calculate the sum of the Ventas column for each of those groups.

Daily_Time_Series_1 = Time_Series_1.groupby(['Año', 'Mes','Dia'])['Ventas'].sum().reset_index(drop=False)

We change the Año (year), Mes (month) and Dia (day) columns to string format, and then we create the Mes-Año-Dia (month-year-day) column. All this is needed for the posterior visualization.

Daily_Time_Series_1['Año'] = Daily_Time_Series_1['Año'].astype('string')
Daily_Time_Series_1['Mes'] = Daily_Time_Series_1['Mes'].astype('string')
Daily_Time_Series_1['Dia'] = Daily_Time_Series_1['Dia'].astype('string')

Daily_Time_Series_1['Dia-Mes-Año'] = Daily_Time_Series_1[['Dia', 'Mes', 'Año']].agg('-'.join, axis=1)
Daily_Time_Series_1
Año Mes Dia Ventas Dia-Mes-Año
0 2021 6 1 26423.78 1-6-2021
1 2021 6 2 18752.01 2-6-2021
2 2021 6 3 22812.84 3-6-2021
3 2021 6 4 107889.11 4-6-2021
4 2021 6 5 136714.44 5-6-2021
574 2022 12 27 64542.49 27-12-2022
575 2022 12 28 59913.84 28-12-2022
576 2022 12 29 53815.43 29-12-2022
577 2022 12 30 52695.32 30-12-2022
578 2022 12 31 39739.67 31-12-2022

579 rows × 5 columns


Now we create a daily time series plot using Seaborn.

fig, ax = plt.subplots(figsize=(13,8))

p=sns.lineplot(x="Dia-Mes-Año", y="Ventas", data=Daily_Time_Series_1 , color='red')

p.set_xticks(np.arange(0 , len(Daily_Time_Series_1) , 40))

plt.setp(p.get_xticklabels(), rotation=90)

plt.title("Daily Time Series", fontsize = 17)

plt.tight_layout()


fig.savefig('p2.jpg', format='jpg', dpi=500)

plt.show()


2.3 Weekly Time Series

We use groupby to group the time series by year, month and week, and then calculate the sum of the Ventas column for each of those groups.

Weekly_Time_Series_1 = Time_Series_1.groupby(['Año', 'Mes','Semana'])['Ventas'].sum().reset_index(drop=False)

We change the Año (year), Mes (month) and Semana (week) columns to string format, and then we create the Semana-Mes-Año (week-month-year) column. All this is needed for the posterior visualization.

Weekly_Time_Series_1['Año'] = Weekly_Time_Series_1['Año'].astype('string')
Weekly_Time_Series_1['Mes'] = Weekly_Time_Series_1['Mes'].astype('string')
Weekly_Time_Series_1['Semana'] = Weekly_Time_Series_1['Semana'].astype('string')

Weekly_Time_Series_1['Semana-Mes-Año'] = Weekly_Time_Series_1[['Semana', 'Mes', 'Año']].agg('-'.join, axis=1)
Weekly_Time_Series_1
Año Mes Semana Ventas Semana-Mes-Año
0 2021 6 22 329029.82 22-6-2021
1 2021 6 23 158833.59 23-6-2021
2 2021 6 24 201568.21 24-6-2021
3 2021 6 25 196610.74 25-6-2021
4 2021 6 26 106849.14 26-6-2021
94 2022 12 48 89144.45 48-12-2022
95 2022 12 49 301484.19 49-12-2022
96 2022 12 50 308829.59 50-12-2022
97 2022 12 51 262326.49 51-12-2022
98 2022 12 52 311029.97 52-12-2022

99 rows × 5 columns


Now we create a weekly time series plot using Seaborn.

fig, ax = plt.subplots(figsize=(13,8))

p=sns.lineplot(x="Semana-Mes-Año", y="Ventas", data=Weekly_Time_Series_1 , color='red')

p.set_xticks(np.arange(0 , len(Weekly_Time_Series_1) , 5))

plt.setp(p.get_xticklabels(), rotation=90)

plt.tight_layout()

plt.title("Weekly Time Series",  fontsize = 17)

fig.savefig('p3.jpg', format='jpg', dpi=500)

plt.show()


2.4 Quarter Time Series

We use groupby to group the time series by year, month and week, and then calculate the sum of the Ventas column for each of those groups.

Quarter_Time_Series_1 = Time_Series_1.groupby(['Año', 'Trimestre'])['Ventas'].sum().reset_index(drop=False)

We change the Año (year) and Trimestre (quarter) columns to string format, and then we create the Trimestre-Año (quarter-year) column. All this is needed for the subsequent visualization.

Quarter_Time_Series_1['Año'] = Quarter_Time_Series_1['Año'].astype('string')
Quarter_Time_Series_1['Trimestre'] = Quarter_Time_Series_1['Trimestre'].astype('string')

Quarter_Time_Series_1['Trimestre-Año'] = Quarter_Time_Series_1[['Trimestre', 'Año']].agg('-'.join, axis=1)
Quarter_Time_Series_1
Año Trimestre Ventas Trimestre-Año
0 2021 2 992891.50 2-2021
1 2021 3 2745995.05 3-2021
2 2021 4 3889761.78 4-2021
3 2022 1 4752970.27 1-2022
4 2022 2 7249462.41 2-2022
5 2022 3 5644448.99 3-2022
6 2022 4 3713061.59 4-2022
fig, ax = plt.subplots(figsize=(13,8))

p=sns.lineplot(x="Trimestre-Año", y="Ventas", data=Quarter_Time_Series_1 , color='red')

p.set_xticks(np.arange(0 , len(Quarter_Time_Series_1) , 1))

plt.setp(p.get_xticklabels(), rotation=90)

plt.title("Quarter Time Series",  fontsize = 17)

plt.tight_layout()

fig.savefig('p4.jpg', format='jpg', dpi=500)

plt.show()


2.5 Yearly Time Series

Time_Series_2 = pd.read_csv('Time_Series_2.csv')
Time_Series_2['Fecha'] = pd.to_datetime(Time_Series_2['Fecha'])

Time_Series_2['Dia'] = Time_Series_2['Fecha'].dt.day

Time_Series_2['Semana'] = Time_Series_2['Fecha'].dt.week

Time_Series_2['Mes'] = Time_Series_2['Fecha'].dt.month

Time_Series_2['Trimestre'] = Time_Series_2['Fecha'].dt.quarter

Time_Series_2['Año'] = Time_Series_2['Fecha'].dt.year
Time_Series_2.loc[:,['Importe', 'Fecha', 'Dia', 'Semana', 'Mes', 'Trimestre', 'Año']]
Yearly_Time_Series_2['Año'] = Yearly_Time_Series_2['Año'].astype('string')
Yearly_Time_Series_2 = Time_Series_2.groupby(['Año'])['Importe'].sum().reset_index(drop=False)
Yearly_Time_Series_2
Año Importe
0 2018 20584.60
1 2019 52704.27
2 2020 81690.09
3 2021 97683.53
4 2022 95286.31
fig, ax = plt.subplots(figsize=(13,8))

p=sns.lineplot(x="Año", y="Importe", data=Yearly_Time_Series_2 , color='red')

p.set_xticks(np.arange(0 , len(Yearly_Time_Series_2) , 1))

plt.setp(p.get_xticklabels(), rotation=90)

plt.title("Yearly Time Series",  fontsize = 17)

plt.tight_layout()

fig.savefig('p5.jpg', format='jpg', dpi=500)

plt.show()


2.6 Box-plot by month

Time_Series_1_copy = Time_Series_1.copy()
Time_Series_1_copy['Año'] = Time_Series_1_copy['Año'].astype('string')
Time_Series_1_copy['Mes'] = Time_Series_1_copy['Mes'].astype('string')

Time_Series_1_copy['Mes-Año'] = Time_Series_1_copy[['Mes', 'Año']].agg('-'.join, axis=1)
Time_Series_1_copy
Fecha Dia Semana Mes Trimestre Año Ventas Mes-Año
0 2022-06-21 21 25 6 2 2022 59.99 6-2022
1 2021-08-03 3 31 8 3 2021 12.72 8-2021
2 2022-08-21 21 33 8 3 2022 11.20 8-2022
3 2022-07-09 9 27 7 3 2022 48.97 7-2022
4 2022-05-14 14 19 5 2 2022 22.95 5-2022
704350 2022-02-10 10 6 2 1 2022 18.90 2-2022
704351 2022-06-02 2 22 6 2 2022 -29.99 6-2022
704352 2022-05-19 19 20 5 2 2022 75.00 5-2022
704353 2022-02-17 17 7 2 1 2022 -10.70 2-2022
704354 2021-06-06 6 22 6 2 2021 6.95 6-2021

704355 rows × 8 columns

from datetime import datetime

# Definir el array de fechas:

Mes_Año = Time_Series_1_copy['Mes-Año'].unique()

# Convertir las fechas a objetos datetime y almacenarlos en una nueva lista:

Mes_Año_dt = [datetime.strptime(f, '%m-%Y') for f in Mes_Año]

# Ordenar la lista de fechas datetime:

Mes_Año_dt_sorted = sorted(Mes_Año_dt)

# Convertir los objetos datetime ordenados de nuevo a una lista de cadenas con formato "mes-año":

Mes_Año_dt_sorted = [datetime.strftime(f, '%m-%Y') for f in Mes_Año_dt_sorted]

# Eliminar los ceros iniciales:

Mes_Año_dt_sorted_sin_ceros = [x.lstrip('0')  if x.startswith('0')  else x   for x in Mes_Año_dt_sorted]
fig, ax = plt.subplots(figsize=(13,7))

p1 = sns.boxplot(x="Mes-Año", y="Ventas", data=Time_Series_1_copy, order=Mes_Año_dt_sorted_sin_ceros, showfliers=True)
 
plt.setp(p1.get_xticklabels(), rotation=90)

plt.title("Box-plot Time Series by Month",  fontsize = 17)

fig.savefig('pbox1.jpg', format='jpg', dpi=500)

plt.show()


fig, ax = plt.subplots(figsize=(13,7))

p1 = sns.boxplot(x="Mes-Año", y="Ventas", data=Time_Series_1_copy, order=Mes_Año_dt_sorted_sin_ceros, showfliers=False)
 
plt.setp(p1.get_xticklabels(), rotation=90)

plt.title("Box-plot Time Series by Month",  fontsize = 17)

fig.savefig('pbox2.jpg', format='jpg', dpi=500)

plt.show()




3 Bibliography